Goto

Collaborating Authors

 ancient text


InteChar: A Unified Oracle Bone Character List for Ancient Chinese Language Modeling

Diao, Xiaolei, Zhou, Zhihan, Shi, Lida, Wang, Ting, Qi, Ruihua, Xu, Hao, Shi, Daqian

arXiv.org Artificial Intelligence

Constructing historical language models (LMs) plays a crucial role in aiding archaeological provenance studies and understanding ancient cultures. However, existing resources present major challenges for training effective LMs on historical texts. First, the scarcity of historical language samples renders unsupervised learning approaches based on large text corpora highly inefficient, hindering effective pre-training. Moreover, due to the considerable temporal gap and complex evolution of ancient scripts, the absence of comprehensive character encoding schemes limits the digitization and computational processing of ancient texts, particularly in early Chinese writing. To address these challenges, we introduce InteChar, a unified and extensible character list that integrates unencoded oracle bone characters with traditional and modern Chinese. InteChar enables consistent digitization and representation of historical texts, providing a foundation for robust modeling of ancient scripts. To evaluate the effectiveness of InteChar, we construct the Oracle Corpus Set (OracleCS), an ancient Chinese corpus that combines expert-annotated samples with LLM-assisted data augmentation, centered on Chinese oracle bone inscriptions. Extensive experiments show that models trained with InteChar on OracleCS achieve substantial improvements across various historical language understanding tasks, confirming the effectiveness of our approach and establishing a solid foundation for future research in ancient Chinese NLP.


Intertextual Parallel Detection in Biblical Hebrew: A Transformer-Based Benchmark

Smiley, David M.

arXiv.org Artificial Intelligence

Identifying parallel passages in biblical Hebrew (BH) is central to biblical scholarship for understanding intertextual relationships. Traditional methods rely on manual comparison, a labor-intensive process prone to human error. This study evaluates the potential of pre-trained transformer-based language models, including E5, AlephBERT, MPNet, and LaBSE, for detecting textual parallels in the Hebrew Bible. Focusing on known parallels between Samuel/Kings and Chronicles, I assessed each model's capability to generate word embeddings distinguishing parallel from non-parallel passages. Using cosine similarity and Wasserstein Distance measures, I found that E5 and AlephBERT show promise; E5 excels in parallel detection, while AlephBERT demonstrates stronger non-parallel differentiation. These findings indicate that pre-trained models can enhance the efficiency and accuracy of detecting intertextual parallels in ancient texts, suggesting broader applications for ancient language studies.


Efficiently Building a Domain-Specific Large Language Model from Scratch: A Case Study of a Classical Chinese Large Language Model

Li, Shen, Hu, Renfen, Wang, Lijun

arXiv.org Artificial Intelligence

General-purpose large language models demonstrate notable capabilities in language comprehension and generation, achieving results that are comparable to, or even surpass, human performance in many natural language processing tasks. Nevertheless, when general models are applied to some specific domains, e.g., Classical Chinese texts, their effectiveness is often unsatisfactory, and fine-tuning open-source foundational models similarly struggles to adequately incorporate domain-specific knowledge. To address this challenge, this study developed a large language model, AI Taiyan, specifically designed for understanding and generating Classical Chinese. Experiments show that with a reasonable model design, data processing, foundational training, and fine-tuning, satisfactory results can be achieved with only 1.8 billion parameters. In key tasks related to language processing of Classical Chinese such as punctuation, identification of allusions, explanation of word meanings, and translation between ancient and modern Chinese, this model exhibits a clear advantage over both general-purpose large models and domain-specific traditional models, achieving levels close to or surpassing human baselines. This research provides a reference for the efficient construction of specialized domain-specific large language models. Furthermore, the paper discusses the application of this model in fields such as the collation of ancient texts, dictionary editing, and language research, combined with case studies.


Automating Violence Detection and Categorization from Ancient Texts

Abdelhalim, Alhassan, Regneri, Michaela

arXiv.org Artificial Intelligence

Violence descriptions in literature offer valuable insights for a wide range of research in the humanities. For historians, depictions of violence are of special interest for analyzing the societal dynamics surrounding large wars and individual conflicts of influential people. Harvesting data for violence research manually is laborious and time-consuming. This study is the first one to evaluate the effectiveness of large language models (LLMs) in identifying violence in ancient texts and categorizing it across multiple dimensions. Our experiments identify LLMs as a valuable tool to scale up the accurate analysis of historical texts and show the effect of fine-tuning and data augmentation, yielding an F1-score of up to 0.93 for violence detection and 0.86 for fine-grained violence categorization.


Productivity Is a Drag. Work Is Divine.

The Atlantic - Technology

Why should humans do anything, if machines can do it better? The answer is crucial to the future of human civilization--and may just lie in religious texts from centuries ago. From the digital (Google searches and Slack chats) to the purely mechanical (washing machines and microwaves), humans use tools nearly constantly to enhance or replace our own labor. Those that save time and effort are easy to appreciate--I have yet to meet someone who misses scrubbing clothes by hand. But the rapid rise of artificial intelligence--which can now write essays and poetry, create art, and substitute for human interaction--has scrambled the relationship between technology and labor.


Restoring Ancient Ideograph: A Multimodal Multitask Neural Network Approach

Duan, Siyu, Wang, Jun, Su, Qi

arXiv.org Artificial Intelligence

Cultural heritage serves as the enduring record of human thought and history. Despite significant efforts dedicated to the preservation of cultural relics, many ancient artefacts have been ravaged irreversibly by natural deterioration and human actions. Deep learning technology has emerged as a valuable tool for restoring various kinds of cultural heritages, including ancient text restoration. Previous research has approached ancient text restoration from either visual or textual perspectives, often overlooking the potential of synergizing multimodal information. This paper proposes a novel Multimodal Multitask Restoring Model (MMRM) to restore ancient texts, particularly emphasising the ideograph. This model combines context understanding with residual visual information from damaged ancient artefacts, enabling it to predict damaged characters and generate restored images simultaneously. We tested the MMRM model through experiments conducted on both simulated datasets and authentic ancient inscriptions. The results show that the proposed method gives insightful restoration suggestions in both simulation experiments and real-world scenarios. To the best of our knowledge, this work represents the pioneering application of multimodal deep learning in ancient text restoration, which will contribute to the understanding of ancient society and culture in digital humanities fields.


Researchers use AI to decipher ancient Roman texts carbonized in deadly Mount Vesuvius eruption

FOX News

Ancient rock carvings have been uncovered near the Amazon River amid drought conditions in Brazil. A set of ancient texts burned by the volcanic eruption on Mount Vesuvius in 79 A.D. have been deciphered thanks to a team of researchers using AI. The nearly 2,000-year-old texts were unreadable after being charred in a villa in Herculaneum, a Roman town near Pompeii. The texts were discovered in an ancient villa in the town of Herculaneum. Believed to have been owned by the father-in-law of Julius Caesar, the texts were carbonized by the heat of the volcanic debris.


GujiBERT and GujiGPT: Construction of Intelligent Information Processing Foundation Language Models for Ancient Texts

Wang, Dongbo, Liu, Chang, Zhao, Zhixiao, Shen, Si, Liu, Liu, Li, Bin, Hu, Haotian, Wu, Mengcheng, Lin, Litao, Zhao, Xue, Wang, Xiyu

arXiv.org Artificial Intelligence

In the context of the rapid development of large language models, we have meticulously trained and introduced the GujiBERT and GujiGPT language models, which are foundational models specifically designed for intelligent information processing of ancient texts. These models have been trained on an extensive dataset that encompasses both simplified and traditional Chinese characters, allowing them to effectively handle various natural language processing tasks related to ancient books, including but not limited to automatic sentence segmentation, punctuation, word segmentation, part-of-speech tagging, entity recognition, and automatic translation. Notably, these models have exhibited exceptional performance across a range of validation tasks using publicly available datasets. Our research findings highlight the efficacy of employing self-supervised methods to further train the models using classical text corpora, thus enhancing their capability to tackle downstream tasks. Moreover, it is worth emphasizing that the choice of font, the scale of the corpus, and the initial model selection all exert significant influence over the ultimate experimental outcomes. To cater to the diverse text processing preferences of researchers in digital humanities and linguistics, we have developed three distinct categories comprising a total of nine model variations. We believe that by sharing these foundational language models specialized in the domain of ancient texts, we can facilitate the intelligent processing and scholarly exploration of ancient literary works and, consequently, contribute to the global dissemination of China's rich and esteemed traditional culture in this new era.


SikuGPT: A Generative Pre-trained Model for Intelligent Information Processing of Ancient Texts from the Perspective of Digital Humanities

Chang, Liu, Dongbo, Wang, Zhixiao, Zhao, Die, Hu, Mengcheng, Wu, Litao, Lin, Si, Shen, Bin, Li, Jiangfeng, Liu, Hai, Zhang, Lianzheng, Zhao

arXiv.org Artificial Intelligence

The rapid advance in artificial intelligence technology has facilitated the prosperity of digital humanities research. Against such backdrop, research methods need to be transformed in the intelligent processing of ancient texts, which is a crucial component of digital humanities research, so as to adapt to new development trends in the wave of AIGC. In this study, we propose a GPT model called SikuGPT based on the corpus of Siku Quanshu. The model's performance in tasks such as intralingual translation and text classification exceeds that of other GPT-type models aimed at processing ancient texts. SikuGPT's ability to process traditional Chinese ancient texts can help promote the organization of ancient information and knowledge services, as well as the international dissemination of Chinese ancient culture.


What Happened in Reinforcement Learning in 2022

#artificialintelligence

Just like how we learn from our environment and our actions determine whether we are rewarded or punished, so do reinforcement learning agents whose ultimate aim is to maximise the rewards. This article brings the top 8 reinforcement learning innovations that shaped AI across several industries in 2022. Alphabet's DeepMind collaborated with the University of Venice, the University of Oxford and the Athens University of Economics and Business to build a deep neural network called'Ithaca', which can restore missing text from ancient texts. In a paper published in Nature, DeepMind stated that Ithaca was trained using natural language processing (NLP) to not only recover lost ancient text that has been damaged over time but also identify the original location of the text and establish the date when it was made. With DeepMind's latest release AlphaTensor, an AI system (based on a 3D board game), researchers shed light on a 50-year-old fundamental mathematics question of finding the fastest way to multiply two matrices.